March 27, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
We’ll dig into generative models more over the final part of this class.
Just a quick overview/solution today
Recall that all machine learning is about learning a probability distribution
Most commonly supervised learning
Under some light assumptions:
\[ f(\mathbf x) = P(y | \mathbf x) \]
There’s also unsupervised learning
Different goal: Learn structure within the input under set of assumptions
Given a set of observations, \(\mathbf x_i \in \mathbf X\), learn about the structure of the inputs
\[ f(\mathbf x) = P(\mathbf x) \]
These methods are designed to learn structure in an input corpus
Clustering - find groups of inputs that are similar
PCA - project high dimensional data to a lower dimensional subspace that retains information
Density estimation - learn the structure of the data in terms of likelihood
Why learn \(f(\mathbf x)\) instead of \(f(y | \mathbf x)\)?
Structure: Choose an appropriate set of customers for different strategies based on clusters. (K-Means)
Description: Visualize a high dimensional feature set in 2 or 3 dimensions without losing info. (PCA)
Latent Variables: Find hidden structure within the data that corresponds to important concepts (Factor Analysis)
Prototypes: Find the most dog of all dogs and design supervised learning frameworks to match this info (Semi-supervised learning)
Another purpose. Dialogue from The Wire:
“I got the shotgun. You got the briefcase.”
“You ain’t a pawn in this game. You’re the whole damn chessboard.”
Another purpose. Dialogue from The Wire:
“I got the shotgun. You got the briefcase.” (Real)
“You ain’t a pawn in this game; you’re the whole damn chessboard.” (Fake by ChatGPT)
Generative Unsupervised Models learn from a set of inputs (text, images, tabular data)
The general strategy:
Assume that each training example is an i.i.d. draw from a true data generating distribution, \(\mathbf x_i \sim \mathbf X\)
For text, over all words
For images, over all pixels
Given samples, pick the parameters, \(\boldsymbol \theta\), from a family of model distributions that minimize some notion of distance between the data we see and the model distribution
Then sample from the data generating distribution to get new viable feature values!
Get a new picture
Get a new set of words
This is actually the structure for common unsupervised learning models!
K-Means Clustering
Assume each \(\mathbf x_i\) conditioned on a latent class label follows a multivariate normal distribution
\[ f(\mathbf x_i | c_i = g) \sim \mathcal N_P(\mathbf x_i | \boldsymbol \mu_g , \boldsymbol \Sigma_g) \]
(Probabilistic) PCA
Assume each \(\mathbf x_i\) is generated as the product of a weight matrix, \(\underset{(P \times K)}{\boldsymbol \Lambda}\), and a latent variable, \(\underset{(K \times 1)}{\mathbf z_i}\)
\[ f(\mathbf x_i | \boldsymbol \Lambda , \mathbf z_i) \sim \mathcal N_P(\mathbf x_i | \boldsymbol \Lambda \mathbf z_i , \mathcal I_P) \]
For a new value of the latent variable ( \(c\) or \(\mathbf z\)), generate a new output as a draw from the model distribution
A new observation within the cluster dictated by \(c\)
A new observation given a latent vector, \(\mathbf z\), drawn from the overall multivariate normal!
That’s generation!
Note that our old methods, more or less, require that the data generating distribution be multivariate normal
Common mean
Common covariance matrix between features (pixel values/words)
Does this seem like a viable assumption?
Images and text have complex dependencies
Neighborhood effects
Context
Do you think that a multivariate normal distribution that is a linear combination of weights and hidden variables can effectively model the joint distribution of images?
Also, incredibly high dimensional
The IPhone 15 claims to be able to take pictures that are 6000 x 4000 pixels - super high clarity!
The total number of possible images we could see
\[ 256^{6000 \times 4000 \times 3} \approx 256^{72 \text{million}} \]
DALL-E seems to be able to generate anything we want!
Fortunately, the world is structured!
We aren’t just a random collection of different colors
I’d be willing to venture that I’m the only person in the world right now typing the text “Peanut Butter, Egg, Dirt”
Some things are just unlikely to be viable outputs
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
All methods we’ll talk about can be sampled!
A final broad point:
We can also use these frameworks to learn the joint density of the inputs and any labels to do conditional generation
\[ P(\mathbf x | y) = \frac{P(y | \mathbf x) P(\mathbf x)}{P(y)} \]
If we learn the joint density of the inputs and any labels, then we can condition on the label
Train a classifier, multiply by joint input probability to get posterior
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
A collection of images
A collection of sentences
The easiest way to approach this is with a common probability identity.
Let \(\mathbf x\) be a vector of inputs - \([x_1,x_2,....,x_P]\)
The joint density over inputs is then:
\[ P(\mathbf x) = f(x_1,x_2,x_3,...,x_P) \]
The probability chain rule:
\[ f(x_1,x_2,x_3,...,x_P) = f(x_1)f(x_2 | x_1)f(x_3 | x_1,x_2)...f(x_P | x_{P-1}, x_{P-2},...) \]
If it is possible to learn the conditional density of the next input given the previous inputs, then we’ve learned the joint density of the inputs!
Does this look like anything we’ve recently talked about?
Generative Pretrained Transformers are autoregressive generators
Find the weights for masked self-attention that maximize the probability that we generate the correct next word
Learns \(P(\mathbf x)\) instead of \(P(y | \mathbf x)\)!
Technically, provides \(\underset{\mathbf x}{\text{argmax}} P(\mathbf x)\)
Using GPT for classification is kinda a square peg in a round hole
A lot like K-Means + Logistic Regression for classification
Semi-supervised learning
ChatGPT is trained on the entirety of the English language
Conditional language generation follows the same principle:
Let \(\mathbf y\) be the prompt token
\[ P(\mathbf x | \mathbf y) = f(x_1 | \mathbf y)f(x_2 | \mathbf y , x_1)... \]
At this point in time, this is the only real dog in the text generation fight
Easy to understand if you understand transformers
This one-ahead approach seems to work remarkably well for language generation
I have a hard time thinking about the next frontier in text generation - just improvements to the GPT training process
Can we generate images this way?
Sorta!
The first proposed method was the PixelRNN (Van den Oord, 2016)
Assume images are generated as rectangular pixel grids
Start in the top left corner, \(x_{11}\)
Take a draw from two conditional densities to get next two pixels, \(f(x_{12} | x_{11})\) and \(f(x_{21} | x_{11})\)
Get the next 4 pixels
Then 8…
The probability model uses the LSTM architecture.
Assume each RGB vector for each pixel depends on a hidden state that is determined recurrently from its neighbors
\[ \mathbf h_{x,y} = f(\mathbf h_{x-1,y}, \mathbf h_{x,y-1}) \]
Train over a large corpus to learn all of the hidden state transitions!
Implicitly depends on all previously seen pixels!
This works quite well in some situations!
What do you think are the drawbacks to this approach for image generation?
Advancements in recurrent image generation:
Pixel CNN - exchange recurrence for CNN style windowing; a little faster
ImageGPT - exchange recurrence for masked self-attention; a little faster with really large training sets
These do a great job when we can compute them in finite time!
Painfully slow!!!
ImageGPT has only been shown to work for up to 64 x 64 input images
Literal days of training time to learn joint densities for small-ish data sets
Can’t be used generally to create a dictionary of all images!
What autoregressive methods do well:
Explicit density estimation - given a starting prompt/word, we can compute the joint probability of the next word directly
Fast at generation (for text)
What autoregressive methods don’t do well:
Images (too high dimensional)
No feature representation
What can we do to generate images?
The rest of the class will be on image/video generation!
These methods can be used for text (or other 1d sequential data like sound waves)
Autoregressives are just the dominant generation method at this point in time!
Also, very useful for descriptive purposes!
A classic in the unsupervised literature is principal components analysis.
A more general statement:
Given a corpus of inputs, find a low dimensional representation of the inputs that minimizes reconstruction error
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - d(e(\mathbf x_i))\|^2_2 \]
\(e(\mathbf x_i) = \mathbf z_i\) is an encoder function that maps the input to a latent space
\(d(\mathbf z_i)\) is a decoder function that maps the latent variable back to original feature space
How we choose to find \(e()\) and \(d()\) dictates the method
PCA
Given a \(N \times P\) matrix of input features, find a single weight matrix, \(\mathbf W\), with column rank \(K\) that minimizes the objective function:
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf W^T \mathbf x_i\|^2_2 \]
If the input features are correlated, then we don’t need to see them all individually to reconstruct the input
Find a lower dimensional representation of the input vector that retains most of the information
\(\mathbf W\) is a \(P \times K\) matrix of weights where \(K << P\) that controls both the mapping of the input to the latent space and the mapping of the latent space back to the input space
\(\mathbf z_i\) is a \(K\)-dimensional latent representation of the input that is derived through a linear mapping - \(\mathbf W^T \mathbf x_i\)
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf z_i \|^2_2 \]
Two additional constraints:
The columns of \(\mathbf W\) should be orthogonal directions
The columns of \(\mathbf W\) should be normalized
PCA operates by using the eigenvalues and eigenvectors of the square covariance matrix for the centered feature matrix:
\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]
where \(\mathbf Q\) is the collection of \(P\) eigenvectors ordered by eigenvalue (from large to small)
To create a rank \(K\) approximation of the original feature matrix, set \(\mathbf W = \mathbf Q_{1:K}\)
What’s neat about PCA is that it produces the best orthogonal rank-K approximation to the input under linearity and squared reconstruction error
No other method (under these constraints) will do better!
This proof can be found here
Why this construction?
It admits a nice analytical solution with good properties.
Can be solved via a basic computer that can find eigenvectors
Computational constraints led to the solution - no other particularly good reason!
Using PCA, we can:
Get rid of redundant features
Reduce dimensionality
Parse away noise
Pass latent scores, \(\mathbf Z\), to supervised learning procedures
Additionally, the dimensions of the latent space correspond to factors of variation within the training set!
A key component of the data scientist’s toolkit!
Problem: Linearity and strictly invertible mappings are restrictive
Think of PCA as creating a mapping from the feature space to a latent space and then inverting the mapping to return an approximation to the original feature space
PCA seeks to solve the generic problem of reconstruction:
\[ \|\mathbf X - d(e(\mathbf X))\|^2_2 \]
This is referred to as a deterministic bottleneck autoencoder
Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!
Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )
Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible
Note that PCA is a special case!
No good reason to do this when we have autodiff that can solve for arbitrarily complex models
How do you choose the size of the bottleneck?
This largely echoes the same discussion for PCA
Just choose a small number and see how well we’re able to recover the images
Stop adding dimensions when the next one doesn’t reduce the loss too much
Use validation error to choose - typically not needed since this would require serious tuning.
Set at 2 for images, 16/32/64/128/etc.
A few notes:
Everyone understands PCA, not necessarily autoencoders. Follow disciplinary norms.
For images, the CNN backbone helps to induce structure in the recovered images! The reconstruction loss is about the same as a deep feedforward autoencoder in a lot of situations, but it helps to find structure for recovery.
For images, autoencoders strictly dominate PCA!
What can we do with deterministic autoencoders?
Like transformers, split for different tasks.
Encoder ( \(\mathbf X \rightarrow \mathbf Z\) )
Use as you would for PCA. Pass lower dimensional representation to supervised model for better predictive performance.
Use to examine feature maps to understand sources of shared variation within input features.
The decoder is the interesting part!
Decoder ( \(\mathbf Z \rightarrow \mathbf X\) ):
In theory, \(\mathbf Z\) shares most of the same information as \(\mathbf X\).
Lower dimensional representation means more manageable.
The decoder dictates a mapping from \(\mathbf Z\) to \(\mathbf X\)
Any thoughts?
Given an input \(\mathbf z\), we could use that to translate back up to the feature space!
How do we choose \(\mathbf z\) to be viable?
Random dart throws?
Pick a point in the convex hull of the latent space defined by the training data?
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
This approach doesn’t generate viable images
This isn’t a proper probabilistic model!
There isn’t ever an expression of \(P(\mathbf X)\)
The method can’t learn how to fill in the gaps between digits appropriately because we never asked it to do that!
We can’t say the probability with which we would expect to see a data point given the data.
We can’t sample from the resulting approximation to \(P(\mathbf X)\) because there isn’t a distribution.
A probabilistic version of PCA (sort of kinda) is called factor analysis
The construction is pretty similar - \(P\) dimensional feature space to \(K\) dimensional latent space, \(\mathbf Z\)
The difference is that we give structure to the latent space by making a prior assumption
Not that consequential since the latent space is to be learned from the data.
Just determines what the latent space will look like.
Most common version - Gaussian factor analysis
\[ f(\mathbf z) = \mathcal N_K(\mathbf z | \mathbf 0 , \mathcal I_K) \]
\[ f(\mathbf x | \mathbf z , \boldsymbol \Theta) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z + \boldsymbol \alpha , \boldsymbol \Psi) \]
where \(\mathbf W\) is a \(P \times K\) matrix of weights, \(\mathbf z\) is a \(K\)-vector of latent values, \(\boldsymbol \alpha\) is an intercept term, and \(\boldsymbol \Psi\) is a \(P \times P\) covariance matrix.
What does this give us?
Like PCA, the goal is to find \(\mathbf W\) and \(\mathbf Z\) that minimize the reconstruction error given the training data.
But, we’re going to show that this is equivalent to estimating a Bayesian model
And posterior distributions represent probabilistic distributions over the latent space
e.g. we’re properly estimating a low rank approximate distribution over the input features that can be used to viably generate new feature sets!
Then, port this generative model to the autoencoder framework to get the variational autoencoder